Integrated Text and Image Understanding for Document Understanding
نویسندگان
چکیده
Because of the complexity of documents and the variety of applications which must be supported, document understanding requires the integration of image understanding with text understanding. Our docum(,nt understanding technology is implemented in a system called IDUS (Intelligent Document Undcrstanding System), which creates the da ta for a text retrieval application and the automatic generation of hyperttrxt li .ks. This paper summarizes the areas of research during IDUS development where we have found the most benelit from the integration of image and text understanding. 1. I N T R O D U C T I O N As more and more of our daily transactions involve computers, we would expect the volume of paper documents generated to decrease. However, exactly the opposite is happening. ( 'onsiderable amounts of information are still generated only in paper form. This, compounded by volumes of legacy l)a.per documents still cluttering offices, creates a need for efficient methods for converting hardcopy material into a computerusable form. However, because of the Sol)histication of applications requiring electronic documents (e.g. routing and retrieval) and the complexity of the documents themselves, it is not sufficient to simply scan and perform OCR (optical character recognition) on documents; deeper understanding of the document is needed. Comprehensive document understanding involves determining the form (layout), as well as the function and the meaning of the document. Document understanding is thus a technology area which benefits greatly from the integration of text understanding with image understanding. Text understanding is necessary to operate on the textual content of the document and image understanding is necessary to operate on the pixel content of the document. We have found great benefit from intertwining the two technologies instead of employing them in a pipeline fashion. We expect that more sophisticated document applications in the future will require even closer knitting. Our document understanding technology is implemented in a system called IDUS (Intelligent Document Understanding System, described in Section 2), which creates the da ta for a text retrieval application and the automatic generation of hypertext links. This paper summarizes areas of research during IDUS development where we have found the most benefit from the integration of image processing with natural language/ text processing: Document layout analysis (Section 3), OCR correction (Section 4), and Text analysis (Section 4). We also discuss two applications we have implemented (Sections 5 and 6) and future plans in Section 7. 2. G E N E R A L I D U S S Y S T E M D E S C R I P T I O N IDUS employs four general technologies image understanding, OCR, do(:ument layout analysis and text understanding in a knowledge-based cooperative fashion[l~ 2]. The curre , t implemenlation is on a SPAl~Cstation T M II with the UNIX TM operating system using the 'C ' and Prolog programlning languages. OCR is performed with the Xerox hnaging Systems ScanWorX TM Application Programmer's Interface toolkit. All features are accessible via an X-Windows T M / M o t i f TM user interface. After scanning the document page(s), IDUS works pagg-bypage, pcrtorming image-based segmentation to initially locatc the primitiw~ regions of text and nontext which are manipulated d , r iug the logical and functional analysis of the document. Each text unit 's content is internally homogeneous in physical at tr ibutes such as font size and spacing style. The ASCII text associated with each block is found through OCR and a set of features based both on text a t t r ib . t es (e.g., number of text lines, font size and type) and geometric at tr ibutes (e.g., location on page, size of block) is used to refine the segmentation and organize the blocks into proper logical groupings, i.e., "articles". The ASCII text for each "article" is assembled in a proper reading order. During this process the column structure of the document is determi,(,(I, and noise and nontext blocks are eliminated. A text processing component performs a finguistic analysis to extract key ideas from each article and then represent them by a semantic component, the case frame. Each "article" text is saved as part of the document corpus and may be retrieved through a query interface. 3. D O C U M E N T L A Y O U T A N A L Y S I S Document layout analysis determines the intraand interpage physical, logical, functional and topical organiza t ion of a document. Many applications (e.g. automatic indexing or database population) require some level of understanding of the document 's textual components. However, it is also critical to discover the document layout structure for two reasons: (1) document layout at t r ibutes such as position of text. and font specifications are key clues into the relative importance of textual content on a page and (2) understanding the layout
منابع مشابه
روش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملDocument Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)
Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...
متن کاملLearning Document Image Features With SqueezeNet Convolutional Neural Network
The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملActive document versioning: from layout understanding to adjustment
This paper introduces a novel Active Document Versioning system that can extract the layout template and constraints from the original document and then automatically adjust the layout to accommodate new contents. “Active” reflects several unique features of the system: First, the need of handcrafting adjustable templates is largely eliminated through layout understanding techniques that can co...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994